This program is offered by Leanne Hyndman, the leader of Amber Community, and aims to find useful information through the available data provided by the team or other contributors on the internet and provide data-driven support.
To achieve this aim, I cleaned the data from Amber Community, joined it with other data, did some exploration analysis, made the time series analysis, and compared the population and referrals spatial distribution.
The data contains the following variables:
There are mainly two findings: one is that the number of referrals during the time shows a weak seasonality weekly and monthly, which may be due to the regular schedule of staff. The other is that the institution seems to have different levels of “attractiveness” to different regions: the people who live in regional areas seem to be more likely to come to the institution.
/The data is from three sources: Amber Community for the essential data, the r package called “adsmapsdata” for the map data, and the government’s website for demographics(address: https://www.coronavirus.vic.gov.au/victorian-coronavirus-covid-19-data).
Amber Community, formerly Road Trauma Support Services Victoria, is a not-for-profit organization contributing to the safety and well-being of road users. To provide data-driven support.
They provide counselling and support to people affected by road trauma and address the attitudes and behaviours of road users through education. Also, They deliver a range of education programs addressing the behaviours and attitudes of drivers to reduce the incidence of crashes, injuries and fatalities, and the associated trauma and grief.
Now, they are interested in a few questions and wish that we can provide some data-driven support so that they can have more insight into their meaningful work and contribute more to road users’ mental and physical health.
Summarise the data and note any changes over the period:
referred by
number of days between date received and date entered
client type
number of referrals per day/week/month
Hopefully, see any seasonality
Figure out how do the location demographics compare to the population of VIC?
A tricky question here is to compare the demographics, which I considered to use the formula:
\[\frac{a_1}{a_2}-\frac{b_1}{b_2}\]
The data looks tidy but a bit messy with a lot of missing values. With the help of my mentor Rob Hyndman, I removed useless variables and cleaned the variable names and created the table referrals_clean1.
Then I checked the data type, name and value in each variable using the function glimpse():
After checking that, I renamed the variable and combined the categorical variables’ values that are identical at a certain level and then created a new table called referrals_clean2.
Next, I created a new variable “day_of_week” to check whether more referrals are received on Monday, and the result shows that Monday does come with more referrals, which indicates that there are possibilities of seasonality. According to the table above, there are more referrals on Monday than on other days of the week. This might be because usually some referrals on weekends are postponed to the following day, which is Monday, and this can also explain why Tuesday has the second-largest amount of referrals, and the rest of the weekdays(Wednesday, Thursday and Friday) have about the same amount, which is still significantly larger(around 40%) than the weekend.
I had a look at the summary of the client_type variable:
From the table above we can see that the major source of clients is from drivers, witnesses and bereaved people.
I also had a look at the summary of the referred_by variable:
And it is clear that VPeR is the major source of referrals.
Then I checked the data type, name and value in each variable:
## Rows: 8,037
## Columns: 8
## $ x1 <dbl> 393, 14, 16, 73, 318, 254, 257, 18, 122, 138, 173, 2…
## $ referral_id <dbl> 402, 18, 20, 77, 326, 261, 264, 22, 128, 144, 179, 2…
## $ date_received <date> 2016-06-18, 2016-07-01, 2016-07-01, 2016-07-03, 201…
## $ date_entered <date> 2016-08-19, 2016-07-04, 2016-07-05, 2016-07-04, 201…
## $ client_type <chr> "driver", "driver", "driver", "driver", "injured pas…
## $ referred_by <chr> "VPeR", "VPeR", "VPeR", "VPeR", "VPeR", "VPeR", "VPe…
## $ we_suggest_you_do <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ postcode <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
After checking that, I renamed the variable and combined the catogorical variables’ values and created referrals_clean2.
Next I created a new variable “day_of_week” to check whether more referrals are received on Monday
| day_of_week | proportion |
|---|---|
| Sun | 11.011572 |
| Mon | 17.357223 |
| Tue | 15.553067 |
| Wed | 14.657210 |
| Thu | 14.955829 |
| Fri | 14.806520 |
| Sat | 10.265024 |
| NA | 1.393555 |
Figure 1.1: proportion of days of week
According to the figure 1.1 , there are more referrals on Monday than on other days of week. This might be because that usually some referrals on weekend are postponed to the following day, which is Monday, and this can also explain why Tuesday has second largest amount of referrals, and the rest of the week days(Wednesday, Thursday and Friday) have about the same amount, which is still significantly larger(around 40%) than weekend.
We can also have a look at the summary of client type variable
| client_type | n |
|---|---|
| driver | 2662 |
| witness | 2452 |
| bereaved | 1038 |
| fam/fr of casualty | 500 |
| other injured person | 428 |
| passenger | 428 |
| unknown | 272 |
| rider | 241 |
| other | 16 |
From the table 1.1 above we can see that the major source of client is from driver, witness and bereaved people.
and referred_by variable
| referred_by | n |
|---|---|
| VPeR | 6876 |
| Self | 409 |
| VSA | 223 |
| unknown | 196 |
| TAC | 191 |
| Other | 55 |
| Family/friend | 37 |
| Police | 35 |
| Victims of Crime | 15 |
And it is clear that VPeR is the major source of referrals.
To make comparasion between the population distribution and the distrbution of the source of referrals, I created a variable calls attractiveness, which is computed by the formula below: \[\frac{a_1}{a_2}-\frac{b_1}{b_2}\] In this formula, \(a_1\) stands for the number of referrals in a certain postcode zone, \(a_2\) stands for the number of referrals in the whole area, \(b_1\) stands for the population in a certain postcode zone and \(b_2\) stands for the population in the whole area.
The reason I didn’t use a1/b1 is the problem with robustness when there is zero referral because in some areas the number of referrals(a1) is zero, therefore a1/b1 is also zero, but the same “zero” can have different meaning since the size of the population in that certain area(b1) can vary a lot.